Code
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)Working effectively with data in R is greatly simplified by the tidyverse, a collection of packages designed for data science. The tidyverse provides a consistent framework for data manipulation, visualization, and modeling, which helps students learn generalizable skills rather than package-specific tricks.
A central concept in the tidyverse is thetibble, a modern re-imagining of the traditional R data frame. Tibbles keep the familiar two-dimensional table structure but introduce improvements such as preserving variable types, supporting list columns, and displaying data more cleanly in the console. These features make them easier to use in practice, especially with large datasets.
Finally, the idea of tidy data lies at the heart of the tidyverse. According to Hadley Wickham’s definition, tidy data means each variable forms a column, each observation forms a row, and each type of observational unit forms a table. Tidy data creates a standardized structure that enables smooth use of functions across the tidyverse, reducing the need for ad hoc data reshaping and making analyses more transparent and reproducible.
vignette("tibble")Tibbles are data frames that enhance the regular “old” data frame from base R. They keep the same two-dimensional tabular structure but are designed to be more consistent, predictable, and user-friendly.
There are several ways to create tibbles depending on the source of the data.
The simplest way is to build a tibble directly from vectors using tibble(). Inputs of length 1 are automatically recycled, and you can refer to variables you just created:
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)You can convert existing data structures into tibbles with as_tibble():
# From a data frame
head(as_tibble(iris))# From a list
as_tibble(list(x = 1:3, y = letters[1:3]))Packages in the tidyverse ecosystem return tibbles when reading data from files:
# From CSV, TSV, or delimited text file
head(readr::read_csv("data/Chetty_2014.csv"))# From Excel files
# readxl::read_excel("data.xlsx")You can also obtain tibbles when working with databases using packages such as DBI and dbplyr:
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "mydb.sqlite")
tbl(con, "tablename") # returns a tibble-like objectA tribble is a transposed tibble, designed for small data entry in code. Column headings are defined by formulas that start with ~:
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)The tibble and readr packages are part of the core tidyverse, so they load automatically with library(tidyverse). Other packages such as readxl and dbplyr belong to the tidyverse ecosystem. They follow the same principles and return tibbles, but you need to load them explicitly.
Tibbles make fewer automatic changes than base R data frames:
For example:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)These column names would not be valid in base R, but are allowed in a tibble.
There are two main differences between tibbles and base R data frames:
Printing Tibbles have a refined print method that shows only the first 10 rows and only as many columns as fit on the screen:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
) |> head()This design avoids overwhelming the console when printing large data frames.
If you need more output, you can adjust options:
print(n = , width = ) controls number of rows and columns.options(tibble.print_max = n, tibble.print_min = m)
options(tibble.print_min = Inf) # always show all rows
options(tibble.width = Inf) # always print all columnsSubsetting Most of the subsetting tools we have used so far generally subset the entire data frame. To pull out just a single variable or value, we can use $ and [[: - [[ extracts by name or position - $ extracts by name with less typing
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x[1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
df[["x"]][1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
# Extract by position
df[[1]][1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
df[[1,1]][1] 0.1749291
Structuring datasets to facilitate analysis is at the core of the principles of tidy data, as described by Hadley Wickham.
Tidy data follows three basic rules:
When these rules are not followed, the dataset is considered untidy. Common signs of untidy data include:
Most data encountered in practice will be untidy. This is because most people are not familiar with the principles of tidy data, and data is often organised to facilitate uses other than analysis, such as making entry easier.
Two common problems to look for are:
Usually, a dataset will only suffer from one of these problems.
To resolve them, the tidyr package provides two key functions:
pivot_longer()pivot_wider()These functions are illustrated with example datasets included in the tidyr package. The tables (table2, table4a) contain data on the number of tuberculosis (TB) cases recorded in different countries for the years 1999 and 2000. The variable cases represents the number of TB cases reported for a given country, year, and type of measure.
A common problem is a dataset where some of the column names are not variable names, but values of a variable:
table4aTo tidy a dataset like this, pivot the offending columns into a new pair of variables.
Steps:
year).cases).table4a |>
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")In the final result, the pivoted columns are dropped, and new year and cases columns are created. Other variables, such as country, are preserved. The cases column now explicitly records the number of TB cases for each year and country.
Function pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows.
table2To tidy this up, analyze the representation in a similar way to pivot_longer():
table2 |>
pivot_wider(names_from = type, values_from = count)In this result, values of type (cases and population) become separate columns, and their associated numbers from count fill in the values. This produces a clearer dataset where each row corresponds to a country and year with distinct variables for cases and population.
The separate() function is used to pull apart one column into multiple columns by splitting wherever a separator character appears. This is useful when a single column actually contains more than one variable.
Consider the dataset table3 included in the tidyr package:
table3Notice the rate column. It contains two variables combined into a single column: the number of cases and the population size, separated by a forward slash. To make the dataset tidy, these should be split into separate variables.
The separate() function takes the name of the column to split and the names of the new columns to create:
table3 %>%
separate(rate, into = c("cases", "population"))This produces two new columns, cases and population, replacing the original rate column. The new columns now contain integer values for the reported tuberculosis cases and the population in each country and year.
By default, separate() splits values wherever it sees a non-alphanumeric character, meaning any character that is not a number or letter. In the example above, it automatically detected and split at the forward slash.
If you want to be explicit, you can specify the character to split on with the sep argument:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")This ensures that the column is split exactly where expected, giving you clearer control over the separation process.
The unite() function is the inverse of separate(). It combines multiple columns into a single column. This can be useful when two or more variables are stored in separate columns but logically belong together.
Consider the dataset table5:
table5In this table, the year of observation is split into two columns, century and year. To make the dataset easier to work with, we can combine these into a single column.
table5 %>%
unite(new, century, year)By default, unite() places an underscore (_) between the values from different columns. In this case, that would produce values like 19_99.
If we want the numbers to run together without any separator, we can control this with the sep argument:
table5 %>%
unite(new, century, year, sep = "")This produces a single column new with values such as 1999 and 2000, giving a cleaner representation of the year variable.
Many datasets consist of multiple tables that are connected in a meaningful way. Together, such collections are called relational data. Each table stores information about a particular type of observation, and the relationships among these tables allow us to draw richer conclusions.
Why not store everything in one giant table? Because different kinds of observations naturally belong in different tables. For example, aircraft information does not change across flights, and weather data apply to all flights departing at a specific time from the same airport.
Keeping separate tables avoids duplication, reduces storage, and prevents inconsistencies when information changes. If the manufacturer name of a plane changes, updating one record in the planes table is enough, rather than updating thousands of flight records.
We use the nycflights13 package as an example again to illustrate. This package includes several tables describing all flights departing New York City in 2013.
library(nycflights13)
flights %>% select(year:day, hour, origin, dest, tailnum, carrier) %>% head()airlines %>% head()planes %>% select(tailnum, manufacturer, model, year) %>% head()airports %>% select(faa, name, lat, lon) %>% head()weather %>% select(origin, year:day, hour, temp, wind_speed) %>% head()Each table contains different but related information:
flights records each departure, including date, time, origin, destination, tail number, and carrier code.airlines provides the full airline names corresponding to carrier codes.airports gives the name and geographic location for each airport.planes contains aircraft details such as manufacturer and model.weather records hourly weather data for each origin airport.These tables are linked by shared variables. For example:
carrier links flights to airlines.tailnum links flights to planes.origin and dest link flights to airports.year, month, day, hour, and origin link flights to weather.A natural question to ask is: how can we combine information across these tables to answer questions such as:
To answer these questions, we must understand keys and joins.
A key identifies how observations in one table relate to those in another.
For example, in planes, the variable tailnum serves as a primary key because each aircraft has a unique tail number. In flights, the same variable acts as a foreign key since many flights can share the same aircraft.
We can verify whether a variable is a primary key by checking for uniqueness:
planes %>% count(tailnum) %>% filter(n > 1)No duplicates imply that tailnum is a valid primary key. To check for unmatched foreign keys:
flights %>% anti_join(planes, by = "tailnum") %>% nrow()[1] 52606
The result shows how many flights lack corresponding aircraft information.
It’s common for foreign keys to have missing matches. This may happen when aircraft were retired, renamed, or missing from the record. Always check unmatched keys before joining tables.
Mutating joins combine variables from two tables based on matching key values. They allow us to enrich one dataset with information from another.
There are four types of mutating Joins:
inner_join(x, y): keeps only rows with matching keys in both tables.left_join(x, y): keeps all rows from x, adding matches from y.right_join(x, y): keeps all rows from y, adding matches from x.full_join(x, y): keeps all rows from both tables.flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2 %>% left_join(planes, by = "tailnum")This left join keeps every flight and adds aircraft details whenever a matching tailnum exists. Unmatched rows contain missing values.
flights2 %>% left_join(airlines, by = "carrier")Now each carrier code is replaced with a readable airline name, making the dataset easier to interpret.
left_join() is most common because it preserves the main dataset. Use inner_join() only when you are sure you want to exclude records that do not match.
Filtering joins keep or exclude observations in one table based on whether they have a match in another. They do not add new columns.
semi_join(x, y): keeps rows in x that have a match in y.anti_join(x, y): keeps rows in x that do not have a match in y.flights2 %>% semi_join(planes, by = "tailnum")This keeps only flights that have aircraft information in planes.
flights2 %>% anti_join(planes, by = "tailnum")These flights are missing aircraft details. anti_join() is especially useful for finding unmatched keys and diagnosing data quality issues. For instance, you might use it to identify which flights are missing weather information or which airports have no corresponding records.
semi_join() and anti_join()
Use semi_join() to filter data based on membership, and anti_join() to find mismatches. They are efficient tools for quality control and exploratory checks.
Relational data represent multiple connected tables describing related entities. Keeping separate tables helps minimize redundancy, reduce errors, and maintain data consistency. Understanding how these tables are linked through keys allows us to integrate information accurately. The tidyverse join functions make it straightforward to enrich, filter, and explore relational datasets such as the nycflights13 example.